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ABSTRACT 

In this paper we present a parallel for-loop scheduler which 
is based on work-stealing principles but runs under a com¬ 
pletely cooperative scheme. POSIX signals are used by idle 
threads to interrupt left-behind workers, which in turn de¬ 
cide what portion of their workload can be given to the re¬ 
quester. We call this scheme Interrupt-Driven Work-Sharing 
(IDWS). This article describes how IDWS works, how it can 
be integrated into any POSIX-compliant OpenMP imple¬ 
mentation and how a nser can manually replace OpenMP 
parallel for-loops with IDWS in existing POSIX-compliant 
C-|—I- applications. Additionally, we measure its perfor¬ 
mance using both a synthetic benchmark with varying dis¬ 
tributions of workload across the iteration space and a real- 
life application on Sandy Bridge and Xeon Phi systems. 
Regardless the workload distribution and the underlying 
hardware, IDWS is always the best or among the best¬ 
performing strategies, providing a good all-around solution 
to the scheduling-choice dilemma. 
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1. INTRODUCTION 

Most parallelism in shared-memory parallel programming 
comes from loops of independent iterations, i.e. iterations 
which can be safely executed in parallel. However, distribut¬ 
ing the iteration space over the available computational re¬ 
sources of a system is not always a simple thing. Fine¬ 
grained control of distribution is often associated with high 
overhead whereas static partitioning of the iteration space 
can lead to significant load imbalance. In both cases, the 
impact on performance is serious. 

Research on an advanced for-loop scheduler was motivated 
by our work on PRAgMaTIc [TS], a hybrid OpenMP/MPI 


mesh adaptivity framework. Profiling data revealed that 
many of PRAgMaTIc’s parallel loops are highly diverse, in¬ 
volving irregular computations which introduce high levels 
of iteration-to-iteration load imbalance. Existing scheduling 
strategies provided by OpenMP fail to achieve good balance 
with low scheduling overhead, whereas adaptive mesh al¬ 
gorithms which constantly modify mesh topology make it 
impossible to balance workload a priori. 

We wanted the new scheduler to be portable and easily 
plug-able into the widely-adopted OpenMP API, so that it 
can target an as wide as possible range of systems, like Fu¬ 
jitsu’s FX-10, a SPARC64-based supercomputer [^. Those 
portability requirements prohibit the use of platform- or 
vendor-specific threading mechanisms and parallel libraries, 
like Intel®Cilk'^'^Plus [HI [3 [3]. On the contrary, they call 
for a POSIX-compliant implementation, based on the fact 
that most operating systems used in scientific computing are 
POSIX-compliant and most compilers (e.g. Linux versions 
of gcc, icc, xlc, etc.) implement OpenMP threads as POSIX 
threads (we have found it out by experimenting with those 
compilers). Of course, since every OS has threading and sig¬ 
nalling mechanisms, the new scheduler can be implemented 
into any compiler on any OS. 

The main contributions of this article are the following: 

• Present an new interrupt-driven work-sharing sched¬ 
uler (IDWS) which can easily be used with existing 
POSIX-compliant OpenMP applications 

• Demonstrate how OpenMP loops can be converted to 
IDWS loops 

• Describe how a compiler vendor can incorporate the 
new scheduler into their product 

• Show using a variety of benchmarks that IDWS is 
a good all around solution to the scheduling-choice 
dilemma, always being among the best-performing strate¬ 
gies in all benchmarks 


The rest of this paper is organized as follows: Section [5] 
provides an overview of loop scheduling options currently 
available, listing their advantages and weaknesses. In Sec¬ 
tion |3 we describe the new scheduler, the way it works and 
explain why it offers a better tradeoff between load balance 
and scheduling overhead compared with other alternatives. 


Section|4]goes into details about the current implementation 
in C++ and explains how it can be used to replace OpenMP 
for-loops in existing codes. We demonstrate the scheduler’s 
performance in Section [5] using both a synthetic benchmark 
and real for-loops from PRAgMaTIc. Finally, we conclude 
the paper in Section [G] 

2. BACKGROUND 

OpenMP offers three different scheduling strategies for par¬ 
allel loops: static, dynamic and guided [?]■ There is also 
a more advanced scheduling technique, known as “work¬ 
stealing”, which is implemented by libraries such as Intel 
Cilk Plus, though it is not part of the OpenMP specifica¬ 
tion, nor is it supported (to the best of our knowledge) by 
any OpenMP implementation. In this section we will present 
these four options and compare them in terms of load bal¬ 
ance, scheduling overhead and overall efficiency. 

2.1 OpenMP static 

Under the static scheduling scheme, the iteration space is 
divided into equally large chunks which are then assigned to 
threads. This can be seen in the first example in Figure [T] 
Partitioning of iteration space is done statically at the be¬ 
ginning of the for-loop, so there is zero scheduling overhead. 
On the other hand, this scheme can lead to significant load 
imbalance, especially in a highly diverse loop. 

2.2 OpenMP dynamic 

Dynamic scheduling is a first approach to the problem of 
load imbalance. Instead of a static partitioning of the it¬ 
eration space, chunks of work are assigned to threads dy¬ 
namically. Once a thread finishes a chunk, it takes the next 
available from the iteration space. This is shown in the 
middle example in Figure [T] Access to chunks is done via 
atomic updates of the loop counter; a thread acquiring a 
chunk reads the current value of the loop counter and incre¬ 
ments it atomically by the chunk size. 

Dynamic scheduling solves imbalance problems as threads 
proceed to the next iterations of the for-loop in a fine-grained 
way. As an immediate consequence, good load balance comes 
at a cost. The loop counter is updated atomically and this 
constitutes a 2-way source of overhead. The two components 
of overhead are related to instruction latency and thread 
competition. The time it takes to execute an atomic in¬ 
struction can vary anywhere between a standard update in 
LI (if the thread performing the update is running on the 
same physical core as the thread which last updated the 
shared variable) and an update in RAM (if the last thread 
to update the shared variable is running on another socket 
in case of NUMA systems). This may not be a problem 
in short for-loops, but becomes easily a hotspot in loops 
with millions of iterations and little work per iteration (i.e. 
when atomic instruction latency is comparable to the loop 
body itself). Secondly, as the number of threads increases, 
so does the competition for the shared variable, leading to 
either (depending on the architecture) increased locking or 
increased number of failed update attempts, thus making 
atomic instruction latency even longer. 

It could be argued that this overhead can be mitigated by 
increasing the chunk size, therefore lowering the number of 


times a thread will need to access the loop counter. On 
the other hand, increasing the chunk size can introduce load 
imbalance once again. Additionally, it is usually impossible 
to know the optimal chunk size at compile time and/or it 
can vary greatly between successive executions of an algo¬ 
rithm. Besides, relying on the chunk size for performance 
optimization puts an extra burden on the programmer. 

We have found that using dynamic scheduling over guided 
in PRAgMaTIc can increase the execution time of specific 
algorithms by up to three times, as will be shown in Section 
[S] Following that, dynamic scheduling was rejected as an 
option for that framework. 

2.3 OpenMP guided 

The guided scheme is an attempt to reduce dynamic schedul¬ 
ing overhead while retaining good load balance. The key 
difference between the two strategies is how chunks of work 
are allocated. Whereas in dynamic scheduling the chunk 
size is constant, the guided scheme starts with large chunks 
and the size is reduced exponentially as threads proceed to 
subsequent iterations of the for-loop. This can be seen in 
the last example of Figure [T] Initial large chunks account 
for reduced atomic accesses to the loop counter while the 
more fine-grained control towards the end of the loop tries 
to maintain good load balance. 

For the most part, guided scheduling works well in PRAg¬ 
MaTIc, yet there are cases where we have observed signifi¬ 
cant load imbalance. This can happen if, for instance, most 
of the work in an irregular loop is accumulated in a few of the 
initial big chunks. In a case like that, threads processing the 
“loaded” chunks are busy for long while others go through 
the remaining “light” iterations relatively quickly and reach 
the end of the for-loop early, waiting for the busy workers 
to finish as well. 

2.4 Work-stealing 

Work-stealing (Bl I13p is a more sophisticated technique 
aiming at balancing workload among threads while keeping 
scheduling overhead as low as possible. The generic work¬ 
stealing algorithm for a set of tasks [4] can be summarized 
as follows. Each thread keeps a deque (double-ended queue) 
of tasks to execute. While the deque is full, the thread pops 
workitems from the front. Once the deque is empty, the 
thread becomes a thief, i.e. it chooses a victim thread ran¬ 
domly and steals a chunk of workitems from the back of the 
victim’s deque. 

For a parallel for-loop with a predefined number of iterations 
N the deques can simply be replaced with pairs of indices 
< istart,ien.d > Corresponding to the range in the iteration 
space [iatartjiend) wMch has been assigned to each thread, 
0 < istart, iend < N. In tMs case, every thread executes 
iterations by using istart as the loop counter whereas thieves 
steal work by decrementing a victim’s lend- 

Accesses to those pairs of indices can lead to race condi¬ 
tions, so they need to be accessed with atomics. Following 
that, work-stealing for for-loops comes close to OpenMP’s 
dynamic scheduling with some chunk size > 1, with a major 
difference being that in work-stealing threads do not com¬ 
pete all together for atomic access to the same shared vari- 
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Figure 1: Example with four threads of the three scheduling strategies offered by OpenMP: static, dynamic 
(chunk=2) and guided. Note that under the guided scheme chuck size is reduced exponentially. 


able (the common loop counter); instead, congestion is rare 
and happens only if two thieves try to steal from the same 
victim. 

Performance can still suffer from load imbalance and schedul¬ 
ing overhead when using work-stealing. The main drawback 
of the classic work-stealing algorithm is that thieves choose 
victims randomly. There is no heuristic method to indicate 
which threads are more suitable victims (i.e. have more re¬ 
maining workload) than others. Stealing comes at a cost 
and picking victims with too little or no remaining work 
is inefficient, as it leads to the need for frequent stealing 
which induces some overhead. Additionally, failed attempts 
do not help balance the workload. As an example of an 
extreme case, a single thread becomes the sole remaining 
worker while the rest waste time trying to steal from each 
other in vain. 

Mitigating the effects of random choice was our main con¬ 
cern when designing the new for-loop scheduler. We devised 
a low-overhead heuristic method for finding appropriate vic¬ 
tims. At the same time, we tried to reduce scheduling over¬ 
head by eliminating the need to use atomics when accessing 
each thread’s < istartjimd > pair. The following section 
describes in detail how the scheduler is implemented. 

3. INTERRUPT-DRIVEN WORK SHARING 

Our new scheduler differs from existing work-stealing ap¬ 
proaches in two major ways. First of all, as was mentioned in 
Section r2.4l every worker constantly “advertises” its progress 
so that thieves can find suitable victims which have been left 
behind. Secondly, a thief does not actually steal work from 
the victim in the classic sense; instead, it interrupts the cho¬ 
sen victim by sending a POSIX signal. The signal handler 
executed by the victim encapsulates the code with which the 
victim decides what portion of its remaining workload can 
be given away. As it becomes apparent, the new scheduling 
algorithm is much closer to work-sharing than work-stealing, 
therefore we call it Interrupt-Driven Work-Sharing (IDWS). 
Nonetheless, we will use work-stealing terminology through¬ 
out this article. 

The abstract description of this scheme can be split into 
three parts: 


Algorithm 1 Parallel loop executed by each thread 
for i — istart^ i ^ iendt i t f -j- 1 do 

flush i > from register file to memory, so that 

> thieves can see this thread’s progress 
execute ith iteration 

flush iend > from memory to register file, as it may 
> have been modified by the signal handler 

end for 


Algorithm 2 Work-stealing 
for all threads do 

remainingt„ -f— iend,t„ — — it^ 

end for 

let T t— for which remainingtn = max 
send signal to victim T 
wait for answer 

update own istart ^iend 

execute loop chunk 


Loop execution (Algorithm\l^. Every thread executes the 
iterations of the chunk it has acquired in the same way as it 
would using OpenMP’s static scheduling scheme. Initially, 
the iteration space is divided statically into chunks of equal 
size and every thread t„ is assigned one chunk. The chunk’s 
boundaries for thread tn, referred to as istart.tn. s-nd iend.tn^ 
are globally visible variables accessible by all threads. Com¬ 
pared to static scheduling, the important addition here is 
some necessary flushing of the loop counter it„ and the loop 
boundary iend.tn ■ More precisely, the value of it„ has to be 
written back to memory (instead of being cached in some 
register) at the beginning of every iteration so that poten¬ 
tial thieves can monitor t„’s progress, calculate how much 
work is left for and decide whether it is worth stealing 
from it. Similarly, the end boundary iend.tn has to read by 
t„ from memory (instead of caching it in some register) be¬ 
fore proceeding to the next iteration because iend,tn might 
have been modified by the signal handler if a thief inter¬ 
rupted t„ while the latter was executing an iteration of the 
for-loop. 











































































































































Algorithm 3 Signal handling 

TGTTld'iTl'iTlQ '^end "^start ^ 

if remaining > 1 then 
chunk ■(— remaining/2 

'^end,thief '^end 

'^start,thief '^end cHunk 

'^end ^ “^end chxiTlk 

end if 

send reply to thief 


Choosing suitable victims (Algorithm |2l). By flushing 
their loop counters, threads advertise their progress so po¬ 
tential thieves can find where to steal from. When a thread 
becomes a thief, it calculates the remaining workload for all 
other threads by reading the associated values istart,t„ 
and iend,t„ ■ This way, we have a heuristic method for find¬ 
ing which thread has the most remaining work, thus being a 
more suitable victim than others. This heuristic may not be 
optimal, but is an improvement over random choice. Once 
the thief has spotted its victim, it sends a signal and waits 
for an answer. The victim executes the signal handler and 
replies with the boundaries (a pair of < istart, iend >) of the 
chunk it wants to give away. Finally, the thief becomes a 
worker once again and moves on to process the newly ac¬ 
quired chunk. 

Signal handler (Algorithm 0. When a victim is inter¬ 
rupted by the signal, control is transferred by the oper¬ 
ating system to the associated handler. Inside that code, 
the thread calculates how much work it can give away (if 
any), replies to the thief with the boundaries of the donated 
chunk, re-adjusts the boundaries of its own chunk and finally 
returns from the signal handler. 

It is clear that there are no races and no need for atomic 
access to any loop variables during the stealing process, as 
the donor is the one who decides what will be donated. Of 
course, switching from user to kernel mode to execute the 
signalling system call and busy-waiting for a reply from the 
victim involves some overhead; however, as will be shown in 
the results section, this method seems to be more efficient 
than classic work stealing. 

4. C++ IMPLEMENTATION AND USAGE 

This section describes how the IDWS scheduler is imple¬ 
mented and how it can replace existing OpenMP for-loops. 

4.1 IDWS namespace 

IDWS is a namespace encapsulating all necessary data struc¬ 
tures and functions used by the new scheduler. Its declara¬ 
tion can be seen in Code Snippet [T] 


struct IDWS: : thread_state_t. The heart of IDWS is a struct 
named thread_state_t, which encapsulates all variables in¬ 
volved in parallel loop execution and work-stealing. Each 
participating thread has its own instance of this struct, which 
is accessible by all other threads. The struct can be seen in 
Code Snippet [21 


namespace IDWS{ 

struct thread_state_t ; 

vector<thread_state_t > thread_state; 

void SIGhandlerUSRl( int sig); 

void IDWS_Initialize () ; 

void IDWS_Finalize(); 

L 

Code Snippet 1: IDWS namespace. It consists of initialisa¬ 
tion and finalisation functions, the signal handler, the defi¬ 
nition of struct thread. St at e_t and the vector holding all 
thread_state_t instances (one per thread). 


struct thread_state_t{ 
size_t start; 
size_t end; 
size_t processed ; 
int current_ctx ; 
bool active; 

int signal_arg ; 
pthread_t ptid; 
pthread_mutex_t comm.lock; 
pthread_mutex_t request.mutex; 
pthread_cond_t wait_for.answer; 

y 

Code Snippet 2: thread_state_t struct 

• start and end: Dehne the current chunk boundaries 

• processed: Is used by a thread to advertise its progress 
through the loop 

• current_ctx: IDWS loops are nowait loops, which 
means that a thread can proceed to the rest of the 
program without synchronising with other threads. In 
order to know whether two threads work inside the 
same loop, so stealing work from one another is valid, 
a counter current_ctx is used, which is incremented 
each time a thread finishes a loop. Here we assume 
that all threads will go through all loops of the pro¬ 
gram. 

• active: Indicates whether the thread is inside the 
loop; this variable is used by thieves to skip imme¬ 
diately threads which have also become thieves. 

• signal_arg: POSIX signals can only have two argu¬ 
ments, what signal is to sent and to whom. The vic¬ 
tim needs to know, however, who the thief is, so sig- 
nal_arg is used by the thief to send its ID to the vic¬ 
tim. 

• comm_lock: In order to avoid needless busy-waiting by 
other thieves while one thief has already sent a signal 
to its victim, we use a lock (in form of a mutex); while 
this lock is held by a thief, other thieves will choose 
other victims to steal from. 

• ptid: POSIX ID of the thread; it is used by the thief 
to raise the signal. 

• request_mutex and wait_f or_answer: POSIX mutex 
and condition variables which assist the process of send¬ 
ing the signal and waiting for a reply. 
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Note that we need two separate mutexes and cannot use re- 
quest_mutex in place of comiii_lock. The former is implicitly 
released by the thief in order to enable the victim to signal 
the condition variable; in the meantime, before the victim 
locks request_mutex, another thief might acquire the lock 
and destroy the process. 


vector IDWS: :thread_state. Each thread has its own in¬ 
stance of the thread_state_t struct. All instances are held 
in a shared vector called thread_state. 


Initialisation and finalisation. Like MPI, IDWS needs to 
be initialised by calling IDWS: : IDWS_Initialize(). Threads 
must create their thread_state_t structs and push them 
back into the shared vector thread_state. Struct initialisa¬ 
tion includes finding POSIX IDs and initialising comni_lock, 
request_mutex and wait_for_answer. Similarly, this data 
has to be destroyed at the end of the program, which is done 
by a call to IDWS: : IDWS_Finalize(). Additionally, we must 
register a signal handler to serve the interrupt. We have 
chosen signal SIGUSRl and function IDWS: :SIGhandlerUSRl 
as the signal handler. Choice of SIGUSRl was arbitrary; it 
should be noted, however, that if an application uses the 
same signal for other purposes, it must re-register the orig¬ 
inal handler upon finishing with IDWS or use a different 
signal in the first place. 


Signal handler. A victim decided what portion of its chunk 
can be donated by executing the signal handler. The way 
it is done is described in Code Snippet |3l The victim first 
checks that the thief works in the same context. Then, it 
calculates how much work it can give away using start, end 
and processed, also leaving a safety margin due to an un¬ 
certainty regarding the true value of processed. In case 
of success, the victim updates both the thief’s and its own 
start and end and sets sig_arg=l to indicate successful do¬ 
nation (otherwise, sig_arg is set to another value). Finally, 
the victim signals the condition variable to let the thief know 
that the signal handler is over. 

4.2 Prologue and epilogue macros 

The new scheduler is defined in two parts, using macros 
IDWS_prologue and IDWS_epilogue. These macros must 
surround the loop body. 


IDWS_prologue macro. Before entering a loop, the itera¬ 
tion space is split into equal chunks which are assigned to 
threads. After that, each thread begins the execution of its 
chunk. Compared to a standard for-loop, a IDWS for-loop 
is defined slightly differently. Apart from checking for the 
end of the loop and incrementing the counter after every it¬ 
eration, in IDWS we must also enforce the compiler to flush 
the counter back to memory and load the updated value of 
fend from memory (which might have been modified by the 
signal handler), as indicated by Algorithm [T] Flushing is 
done selectively for those two variables by casting them to 
volatile datatypes. Using #pragma omp flush would flush 
the entire shared program state, which is not efficient. A 


pseudo-code of how the macro expands is given in Code 
Snippet |4] Parameters TYPE, NAME and SIZE correspond to 
the datatype of the loop counter, its name and the size of 
the iteration space, respectively. In the current implemen¬ 
tation of the new scheduler we assume that loops run from 
0 to SIZE with increments of 1 and that the loop counter is 
of an unsigned integral datatype. 


IDWS_epilogue macro. After a thread finished its chunk, 
it becomes a thief. That means it has to enter the stealing 
process, as described in Algorithm (2] The IDWS_epilogue 
macro serves this purpose. The way the macro expands can 
be seen in Code Snippet[4l The thief calculated for all active 
workers the amount of remaining work. Then, starting from 
the worker with the highest remaining workload, it tries to 
acquire the worker’s coimn_lock. If no suitable worker is 
found, then the thief exits the IDWS loop and proceeds to 
the rest of the code. Otherwise, the thief locks the victim’s 
mutex, sends the signal and waits on the victim’s condi¬ 
tion variable for an answer. The answer comes via sig_arg. 
If sig_arg==-l, then the victim has set the thief’s start 
and end variables, so the thief becomes a worker again. If 
any other answer has been sent back, then the thief exits 
the IDWS loop. It is important to note that a memory 
fence is necessary on the thief’s side between setting the 
victim’s signal argument sig_arg and raising the signal, so 
that we make sure that the victim will see the correct value 
of sig_arg. Locking the victim’s mutex before sending the 
signal works as an implicit memory fence. 

4.3 OpenMP to IDWS 


T^include <omp . h> 

int main (){ 

T^pragma omp parallel 

{ 

T^pragma omp for 

for(TYPE VAR = 0; VAR<SIZE; -h+VAR) { 
do_something(VAR); 

} 

} 

} 

Code Snippet 5: Initial OpenMP for-loop. The loop must 
be inside an OMP parallel region. 

The new scheduler can be used directly with virtually any 
C-t-+ OpenMP application written for any POSIX-compliant 
operating system (provided that the compiler used imple¬ 
ments OpenMP upon pthreads). A prerequisite for convert¬ 
ing an OpenMP loop to a IDWS one is that the former is 
written as shown in Code Snippet [S] i.e. the loop must 
be inside an omp parallel region. Conversion to IDWS 
loops is shown in Code Snippet [6] The user needs to include 
header file “IDWS.h” which can be downloaded from PRAg- 
MaTIc’s page on LaunchpacQ. This header file defines the 
IDWS namespace and the prologue and epilogue macros. 

^https : //code. launchpad. net/"'gr409/pragmatic/IDWS 
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void SIGhandlerUSRl( int sig){ 

int tid = omp_get_thread_num{); 

int sig_thread = thread_state[tid].signal_arg; // Who sent the signal 

pthread_inutex_ 1 ock(&:thread_state [tid]. request.mutex) ; 

// Only share a chunk if both threads are in the same context 
if (thread.state[tid].current.ctx = thread.state[sig.thread].current.ctx){ 
size.t remaining = thread.state[tid].end — thread.state[tid].start — 
thread.state[tid].processed; 

// Leave a safety margin — we do not know if the signal was caught before , 
// after or even in the middle of updating thread_state [ tid ]. processed . 
if (remaining > 0){ 

—remaining; 

size.t chunk = remaining/2; 

thread.state[sig.thread].start = thread.state[tid].end — chunk; 
thread.state[sig.thread].end = thread.state[tid].end; 
thread.state[tid].end —= chunk; 

thread.state[tid].signal.arg = —1; // reply success 
}else thread.state [ tid].signal.arg = —2; // reply failure 
}else thread.state[tid].signal.arg = —2; 

pthread.cond.signal(&thread.state[tid].wait.for.answer) ; 
pthread.mutex.unlock(&thread.State[tid].request.mutex) ; 


Code Snippet 3: Signal handler. 


/* ID WS_prologue (TYPE,NAME, SIZE) starts expanding here */ 

// assume tid = omp_get_thread_num () ; 

thread.state [ tid ]. start = ...; thread.state [ tid ]. end = ...; 
thread.state[tid].processed = 0; thread.state[tid].active = true; 

do{ 

size.t __IDWS_cnt= 0; 

for (TYPE NAME=thread_state[tid]. start; ; ++NAME , ++__IDWS_cnt ) { 

// Force flushing the progress back into memory 

^((volatile size.t *) &:thread.state [ tid ]. processed ) = ..IDWS.cnt ; 

// Force re—laoding the end boundary from memory 

if(NAME >= ^((volatile size.t *) &:thre ad. s t at e [ t id ] . end) ) break; 

/* IDWS_prologue ends here */ 

* loop body is executed here * 

/* IDWS_epilogue starts expanding here */ 

} // end for 

thread.state[tid].active = false; // become a thief 
std : : map<int ,size.t> remaining; 

forall(t in active threads) // only check non—thieves 

remaining[t] = thread.state[t].end — thread.state [ t].start — thread.state [ t].processed; 
traverse remaining from largest to smallest; 

victim = first thread t for which pthread.mutex.trylock(&thread.state [t ].comm.lock) succeeds; 
if (no victim found) break; // exit the do—while loop 

thread.state[victim].sig.arg = tid; // tell the victim who we are 
pthread.mutex.lock(&thread.State[victim].request.mutex) ; 
pthread.kill(thread.state[victim].ptid, SIGUSRl); // send signal 

pthread.cond.wait (^thread.state[victim].wait.for.answer ,&thread.state[victim].request.mutex) ; 
pthread.mutex.unlock(&: thread.State[victim].request.mutex); 

if (thread.State[victim] = —1) thread.state[tid].active = true; // become a worker again 
pthread.mutex.unlock(&: thread.State[victim]. comm.lock) ; 

} while (thread.state[tid].active = true) // end do 

thread.state[tid].current.context++; // proceed to next loop 
/* IDWS_epilogue ends here */ 


Code Snippet 4: Pseudo-code demonstrating how IDWS_prologue and IDWS_epilogue are expanded around the loop body. 
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T^^include <omp . h> 

#include "IDWS.h” 

int main(){ 

IDWS : : IDWS_Initialize() ; 

int nthreads = omp_get_max_threads{); 

T^pragma omp parallel 

{ 

int tid = omp_get_thread_num0; 

IDWS_prologue(TYPE, VAR, SIZE) 
do_something{VAR); 

IDWS_epilogue 

} 

IDWS : : IDWS_Finalize() ; 

} 

Code Snippet 6: Transformed code showing what has to be 
added/modified in order to use the new scheduler instead of 
a standard OpenMP scheduling strategy. 


ir 



Figure 3: Relative execution time (lower is better) 
between IDWS, OpenMP guided and Cilk Plus on 
the Intel Xeon E5-2643 system. For each bench¬ 
mark, the fastest scheduling strategy is taken as ref¬ 
erence (scoring 1.0 on the y-axis). 


Compared to the initial version, we need to define: 


• int nthreads=omp_get_max_threads 0: shared vari¬ 
able outside the parallel region 

• int tid=omp_get_thread_num(): thread-private vari¬ 
able inside the parallel region 


, remove the #pragma omp for directive and the for-loop 
declaration and, finally, surround the loop-body with the 
IDWS_prologue and IDWS_epilogue macros. 
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Figure 2: Relative execution time (lower is better) 
between IDWS, OpenMP guided and Cilk Plus on 
the Intel Xeon E5-2650 system. Eor each bench¬ 
mark, the fastest scheduling strategy is taken as ref¬ 
erence (scoring 1.0 on the y-axis). 


In order to measure the performance of our new scheduler, 
we ran a series of tests using both synthetic benchmarks 


and real kernels from the PRAgMaTIc framework. We used 
three systems: a dual-socket Intel Xeon E5-2650 (Sandy 
Bridge, 2.00GHz, 8 physical cores per socket, 2 hyperthreads 
per core, 32 threads in total), a dual-socket Intel Xeon E5- 
2643 (Sandy Bridge, 3.30GHz, 4 physical cores per socket, 
2 hyperthreads per core, 16 threads in total) and an Intel 
Xeon Phi Ob/01 board (1.2GHz, 61 physical cores, 4 hy¬ 
perthreads per core, 244 threads in total). The two Xeon 
systems run Red Hat Enterprise Linux Server release 6.4 
(Santiago). Both versions of the code (intel64 and mic) were 
compiled with Intel Composer XE 2013 SPl using the -03 
optimisation flag. The benchmarks were run using Intel’s 
thread-core affinity support with the maximum number of 
available threads on each platform. Additionally, we ran 
a second series of benchmarks on Xeon Phi using half the 
available number of threads and more specifically using all 
61 cores with 2 hyperthreads per core. We did so because 
we have observed that for some codes Xeon Phi performs 
best when using this threading configuration. 

The synthetic benchmark was designed to be compute-bound 
with minimal memory traffic and no thread synchroniza¬ 
tion. Our purpose is to show how the different schedul¬ 
ing strategies compare to each other in terms of achievable 
load balance and incurred scheduling overhead without be¬ 
ing affected by other factors (such as memory bandwidth, 
data locality etc.). The synthetic benchmark uses an ar¬ 
ray int states [16M], which is populated with values in the 
range [0..3]. Then, the parallel loop iterates over this array. 
Eor each element i, i € [0..16M), the kernel performs a differ¬ 
ent amount of work according to the value of states [i]. If 
states [i] ==0, nothing is done. If states [i] ==1, the kernel 
computes sin() values of i and powers of i. If states [i] ==2, 
the kernel additionally computes cos() values of i and its 
powers. Einally, if states [i] ==3, the kernel additionally 
computes some sinh() values. 

Array states is populated five times with different distri- 






























Figure 4: Relative execution time (lower is better) 
between IDWS, OpenMP guided and Cilk Plus on 
Intel Xeon Phi using 122 threads. For each bench¬ 
mark, the fastest scheduling strategy is taken as ref¬ 
erence (scoring 1.0 on the y-axis). 


butions of workload and total amount of work. Each popu¬ 
lation has been given a name: 

• Regular: All elements of states are set equal to 2. 
This is a distribution corresponding to a regular loop 
which does the exact same thing in every iteration. 

• Random: states is populated with random values fol¬ 
lowing a uniform distribution. This sub-benchmark 
corresponds to real-life distributions in problems like 
graph coloring or the swap and smooth kernels in PRAg- 
MaTIc. 

• Dense End: Most of the workload is accumulated to¬ 
wards the end of the iteration space, where states [i] =3, 
while the beginning is populated with a uniform mix¬ 
ture of values [0..3). The rest of the iteration space is 
set to 0, i.e. no work. This is a distribution closely 
related to the refinement kernel in PRAgMaTIc. 

• Dense Start: Mirrored distribution of Dense End. Closely 
related to the PRAgMaTIc’s coarsening kernel. This is 
an example of workload distribution for which OpenMP 
guided scheduling is a bad choice. 

• Periodic: There is a repeating pattern of states through¬ 
out the iteration space. It is particularly bad for static 
scheduling with interleaved allocations of iterations (i.e. 
with some chunk size). 

Apart from the synthetic benchmark, we also ran PRAg¬ 
MaTIc using the various scheduling options in order to see 
how each strategy performs in a real-life scenario, where 
compute capacity is not the only performance-limiting fac¬ 
tor. It should be noted that PRAgMaTIc is build upon 
OpenMP, so there are no results for Cilk-t- in this case. 

Table [U Table [5] and Tables |3] & |3] show the execution time 
on the three platforms, respectively, using six scheduling 



Figure 5: Relative execution time (lower is better) 
between IDWS, OpenMP guided and Cilk Plus on 
Intel Xeon Phi using 244 threads. For each bench¬ 
mark, the fastest scheduling strategy is taken as ref¬ 
erence (scoring 1.0 on the y-axis). 


strategies for each distribution of the synthetic benchmark 
and the four PRAgMaTIc kernels. The strategy named 
“OMP static,!” is static scheduling with chunk size equal 
to 1. As can be seen, IDWS is either the fastest schedul¬ 
ing option or very close to the fastest for each benchmark- 
platform combination. Additionally, it clearly outperforms 
Cilk Plus, with the performance gap becoming wider as the 
number of threads increases and Cilk’s design to pick vic¬ 
tims in a random fashion becomes inefficient. Those results 
are a good indication that IDWS is future-proof and ready 
for the thousand-core era. 

Regarding PRAgMaTIc, IDWS’s major competitor seems 
to be OpenMP’s guided scheduling. Despite not being very 
suitable for certain kernels (coarsening) theoretically, in prac¬ 
tice it performs just as well as IDWS. A notable exception is 
the 244-thread case on Xeon Phi, where guided scheduling 
is the worst choice among the available options. 

A comparison of relative performance between the three ma¬ 
jor competitors (IDWS, OpenMP guided and Cilk Plus) is 
shown in Figure [5] (Intel Xeon E5-2650 system). Figure |3] 
(Intel Xeon E5-2643 system). Figure |4] (Intel Xeon Phi with 
122 threads) and Figure[S] (Intel Xeon Phi with 244 threads). 
For each benchmark, we compare the relative execution time 
between IDWS, OpenMP guided and Cilk Plus (for PRAg¬ 
MaTIc kernels there is only IDWS vs OpenMP guided com¬ 
parison). Reference execution time per benchmark, i.e. the 
one which corresponds to 1.0 on the y-axis, is execution time 
of the fastest scheduler. 

6. CONCLUSIONS AND FUTURE WORK 

We have presented an Interrupt-Driven Work-Sharing for- 
loop scheduler which is based on work-stealing principles and 
tries to address major problem of the original work-stealing 
algorithm: random choice of victims. The first implemen¬ 
tation of IDWS works very efficiently, outperforming Intel 
Cilk Plus, while being from slightly slower to considerably 

















faster than the best (per kernel) OpenMP scheduling strat¬ 
egy. These results indicate that IDWS could become the 
universal default scheduler for OpenMP for-loops, freeing 
the programmer from tricky and disruptive management of 
load balance. 

Two main points of focus for further work should be data 
locality and efficiency of work-sharing. Work on locality 
issues has been published by several groups (H m uni n]), 
whereas Adnan and Sato have presented interesting ideas on 
efficient work-stealing strategies [2] , some of which could be 
applicable to our work-sharing scheduler. 
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Synthetic benchmark 

PRAgMaTIc kernels 

Regular 

Random 

Dense End 

Dense Begin 

Periodic 

Coarsen 

Refine 

Swap 

Smooth 

rows 

5.85 

7.48 

4.14 

3.80 

1.13 

12.4 

7.29 

19.9 

11.0 

OMP static 

5.84 

7.51 

15.7 

15.0 

1.13 

18.6 

7.87 

20.4 

12.1 

OMP static, 1 

5.92 

7.60 

4.32 

3.91 

14.5 

15.4 

8.45 

22.5 

12.4 

OMP dynamic 

22.4 

27.0 

21.4 

20.5 

9.72 

67.4 

31.4 

99.7 

17.9 

OMP guided 

5.82 

7.46 

4.27 

7.08 

1.12 

12.1 

6.88 

19.5 

11.1 

Cilk+ 

6.12 

7.76 

4.35 

4.00 

1.20 

- 

- 

- 

- 


Table 1: Execution time in seconds for each benchmark using the 6 different scheduling strategies on a dual¬ 
socket Intel Xeon E5-2650 (Sandy Bridge, 2.00GHz, 8 physical cores per socket, 16 hyperthreads per socket, 
32 threads in total). 



Synthetic bene 

imark 

PRAgMaTIc kernels 

Regular 

Random 

Dense End 

Dense Begin 

Periodic 

Coarsen 

Refine 

Swap 

Smooth 

rows 

8.24 

10.6 

5.86 

5.37 


17.5 

7.06 

24.5 

17.3 

OMP static 

8.25 

10.6 

21.2 

22.7 


29.3 

8.13 

26.8 

18.3 

OMP static, 1 

8.37 

10.7 

6.01 

5.70 

22.2 

23.0 

9.55 

30.7 

19.1 

OMP dynamic 

26.9 

34.6 

22.8 

21.7 

9.43 

63.7 

26.3 

91.8 

22.7 

OMP guided 

8.23 

10.5 

6.07 

9.81 

2.84 

17.6 

7.08 

24.3 

17.3 

Cilk+ 

8.38 

10.7 

5.96 

5.47 

2.92 

- 

- 

- 

- 


Table 2: Execution time in seconds for each benchmark using the 6 different scheduling strategies on a dual¬ 
socket Intel Xeon E5-2643 (Sandy Bridge, 3.30GHz, 4 physical cores per socket, 8 hyperthreads per socket, 
16 threads in total). 



Synthetic bene 

imark 

PRAgMaTIc kernels 

Regular 

Random 

Dense End 

Dense Begin 

Periodic 

Coarsen 

Refine 

Swap 

Smooth 

rows 

11.0 

19.6 

9.60 

9.02 

0.97 

30.7 

17.2 

86.7 

26.9 

OMP static 

12.3 

22.3 

49.0 

54.2 

1.06 

34.7 

19.7 

79.1 

27.7 

OMP static,! 

12.6 

22.9 

11.2 

10.6 

48.3 

35.6 

21.2 

122 

26.2 

OMP dynamic 

40.1 

31.5 

25.7 

25.1 

15.2 

129 

59.6 

234 

29.4 

OMP guided 

10.8 

19.6 

10.3 

23.5 

0.92 

29.9 

15.6 

■;bf« 

24.0 

Cilk+ 

11.3 

19.9 

10.1 

9.59 

1.05 

- 

- 


- 


Table 3: Execution time in seconds for each benchmark using the 6 different scheduling strategies on Xeon 
Phi (1.2GHz, 61 physical cores, 2 hyperthreads per core, 122 threads in total). 



Synthetic bene 

imark 

PRAgMaTIc kernels 

Regular 

Random 

Dense End 

Dense Begin 

Periodic 

Coarsen 

Refine 

Swap 

Smooth 

rows 

7.46 

15.9 

7.46 

7.06 

0.56 

34.2 

19.9 

177 

25.9 

OMP static 

8.13 

16.5 

27.0 

27.1 

0.51 

29.1 

21.3 

174 

19.4 

OMP static,! 

7.65 

16.0 

7.63 

7.24 

24.7 

27.9 

20.0 

202 

19.3 

OMP dynamic 

17.3 

19.8 

13.9 

13.6 

7.68 

81.4 

38.4 

247 

24.4 

OMP guided 

7.27 

15.7 

8.11 

19.6 

0.52 

96.1 

35.1 

275 

46.6 

Cilk+ 

8.03 

16.3 

8.31 

7.97 

0.63 

- 

- 

- 

- 


Table 4: Execution time in seconds for each benchmark using the 6 different scheduling strategies on Xeon 
Phi (1.2GHz, 61 physical cores, 4 hyperthreads per core, 244 threads in total). 
























































































































































































































































